Full DS size: 307511
Out[25]:
Total NaN Values Proportion NaN (%)
PrevRatioRejectedAccepted 16847 5.0

Exploratory AnalysisΒΆ

This notebooks includes the analysis of selected variables (based on their importance at predicting the target variable) and their relationships. Individual analysis of each variable is available in the EDA_appendices notebook.

Dataset SummaryΒΆ

NaN Values by Column:

Out[26]:
Total NaN Values Proportion NaN (%)
ExtSource2 660 0.0
ExtSource3 60965 20.0
ExtSource1 173378 56.0
AmtGoodsPrice 278 0.0
OwnCarAge 202929 66.0
PrevAmtDownPaymentSum 16454 5.0
AmtAnnuity 12 0.0
MeanbureaudaysCredit 44020 14.0
MeanbureauamtCreditSumDebt 51380 17.0
PrevAvgYieldGroup 18945 6.0
PrevCreditReceivedRequestedDiff 16454 5.0
OccupationType 96391 31.0
PrevRatioRejectedAccepted 16847 5.0
MaxbureaudaysCreditEnddate 46269 15.0
PrevLastLoanGoodsCategory 16454 5.0
MeanbureauamtCreditMaxOverdue 123625 40.0
Duplicates ValuesΒΆ
'Duplicated Values: 0'
'Total Columns: 229'

CorrelationsΒΆ

Because we has such a large number of columns we have only included features whhich have an importance value { > X } with our final LGBM model: TODO

V:\projects\ppuodz-ML.4.1\shared\graph.py:1276: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
  corr = round(corr.applymap(pd.to_numeric), 2)
No description has been provided for this image

The TARGET variable (loans with payment difficulties) is most correlated with credit ratings obtained from external sources. The correlation is very weak but still significant.

`` Because the datatypes of features vary we had to use different methods to measure the strength and significance of each pair:

  • Chi-Squared Test: Assesses independence between two categorical variables. For bool-bool pairs due to categorical nature.

  • Point Biserial Correlation: Measures correlation between a binary and a continuous variable. For bool-numerical pairs to account for mixed data types.

  • Spearman's Rank Correlation: Assesses monotonic relationship between two continuous variables. Used for numerical-numerical pairs (for non-normally distributed data).

Since the Chi-Squared test outputs an unbound statistic/value which can't be directly compared to pointbiserialr or Spearman Rank we have converted them to a CramΓ©r's V: value which is normalized between 0 and 1. This was done to make the values in the matrix more uniform however we must note that CramΓ©r's V and Spearman's correlation coefficients are fundamentally different statistics and generally can't be directly compared.

Corelation With the Target VariableΒΆ

Our target variable TARGET show whether the given application had any late payments (value = 1), we can see that no single feature is strongly correlated with it:

Out[31]:
Coefficient P-Value
Column
ExtSource3 -0.161 0.000
ExtSource1 -0.131 0.000
ExtSource2 -0.128 0.000
MeanbureaudaysCredit 0.093 0.000
OccupationType 0.075 0.000
DaysEmployed 0.074 0.000
PrevRatioRejectedAccepted 0.073 0.000
PrevRatioRejectedAccepted_cats_2 0.072 0.000
PrevRatioRejectedAccepted_cats 0.072 0.000
OrganizationType 0.069 0.000
NameEducationType 0.067 0.000
PrevAmtDownPaymentSum -0.057 0.000
PrevCreditReceivedRequestedDiff 0.055 0.000
DaysBirth 0.053 0.000
PrevLastLoanGoodsCategory 0.051 0.000
OwnCarAge 0.050 0.000
MeanbureauamtCreditSumDebt 0.049 0.000
MeanbureauamtCreditMaxOverdue 0.044 0.000
DaysIdPublish 0.042 0.000
CodeGender 0.041 0.000
PrevAvgYieldGroup 0.040 0.000
FlagDocument3 0.039 0.000
AmtGoodsPrice -0.034 0.000
MaxbureaudaysCreditEnddate 0.034 0.000
NameFamilyStatus 0.027 0.002
AmtCredit -0.023 0.001
AmtAnnuity 0.003 0.664

The chart below shows the relationship between selected categorical variables and loan status. E.g. a significantly higher proportion of loans taken out by males had issues.

No description has been provided for this image
Out[38]:
CategoricalDtype(categories=['< 25% Rejected', '> 25% Rejected', 'All Accepted', 'No Previous App.'], ordered=False, categories_dtype=object)

Relationships Between Numerical and Categorical VariablesΒΆ

The charts below show pairs of numerical and categorical features (including some binned numerical features) that have a signficant relationships and at least a small effect size (eta_squared>0.01) based on the non-parametric Kruskal-Wallis Test (one-way ANOVA on ranks) testing whether samples originate from the same distribution.

*It's similar to the Mann–Whitney U test but allows comparing more than 2 groups

V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image
V:\projects\ppuodz-ML.4.1\shared\graph.py:1477: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = _df.groupby(c)[y_target]
V:\projects\ppuodz-ML.4.1\shared\graph.py:1490: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_counts = _df.groupby(c).size()
No description has been provided for this image

Analyzing Credit Scores (ExtSource1)ΒΆ

ExtSource1/2/3 are the variables most strongly correlated with the target variable, they indicate client credit scores obtained from external sources. While th correlation coeficients are very low (only slightly above 0.1) we'll look a bit more into these scores because ussually credit ratings tend be the most useful metric when estimating the risk of specific loans:

C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\2151574185.py:16: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  colors = plt.cm.get_cmap('tab10', 4)
No description has been provided for this image
Summary for combined model:
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 TARGET   No. Observations:               109589
Model:                          Logit   Df Residuals:                   109585
Method:                           MLE   Df Model:                            3
Date:                Mon, 29 Apr 2024   Pseudo R-squ.:                  0.1047
Time:                        19:57:19   Log-Likelihood:                -25636.
converged:                       True   LL-Null:                       -28634.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6002      0.040     14.829      0.000       0.521       0.680
ExtSource1    -2.0989      0.061    -34.382      0.000      -2.219      -1.979
ExtSource2    -1.9640      0.060    -32.654      0.000      -2.082      -1.846
ExtSource3    -2.7793      0.062    -44.483      0.000      -2.902      -2.657
==============================================================================

This is a simple Logistic model that just uses the credit scores to estimate the target variable. The confidence interval shows the the standard deviation of the residuals from a combined logistic regression model (residuals in this context are the differences between the observed values (y_combined) and the predicted probabilities).

Gennerally the explained variabity (Pseudo R-squared) is very quite low at only 0.1047 however the model itself is statistically significant (LLR p-value = 0.0)

Out[41]:
Coefficient Standard Error P-Value Conf. Interval Lower Conf. Interval Upper
const 0.600 0.040 0.0 0.521 0.680
ExtSource1 -2.099 0.061 0.0 -2.219 -1.979
ExtSource2 -1.964 0.060 0.0 -2.082 -1.846
ExtSource3 -2.779 0.062 0.0 -2.902 -2.657

Normalized credit ratings from three sources are inversely related to default risk, with ExtSource3 having the strongest influence. We can see that a basic Logistic model can already provide a reasonably high result (AUC = 0.74). However, we have to note that the results are based on the full training set and are only provided for EDA/feature analysis purposes. Full statistical modelling will be done in further sections.

C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\1233466688.py:6: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['TARGET'] == 1][col],
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\1233466688.py:8: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['TARGET'] == 0][col], label=f'{col} - No Default', shade=True)
No description has been provided for this image
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\1233466688.py:6: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['TARGET'] == 1][col],
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\1233466688.py:8: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['TARGET'] == 0][col], label=f'{col} - No Default', shade=True)
No description has been provided for this image
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\1233466688.py:6: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['TARGET'] == 1][col],
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\1233466688.py:8: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['TARGET'] == 0][col], label=f'{col} - No Default', shade=True)
No description has been provided for this image

We can see that while the the external credit are clearly related to default risk their explanatory power is somewhat limited because there is still a large amount of overlap (especially for ExtSource2, however it's coeefficient in our logistical model is similar to that of ExtSource1.

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Previous Application HistoryΒΆ

Did any clients had previously applied for loans with Home Credit and what were the outcomes of their applications?

Out[21]:
PrevRatioRejectedAccepted_cats
All Accepted        190370
> 25% Rejected       66215
< 25% Rejected       34079
No Previous App.     16847
Name: count, dtype: int64

Did any applicants default on any previous loans?

Out[22]:
TotalDefaults_cats
No Defaults          304114
1 Defaulted Loans      3397
Name: count, dtype: int64

Suprisingly we can see that a ~1% of all applicants who were granted a loans have previously had payment difficulties with a previous loans at Home Credit. This is quite interesting considering that gennerally credit instituions are reluctant to offer loans again to problematic clients.

Total "Defaults"/Loans With Payment Difficulties per applicant:

Out[73]:
TotalDefaults count proportion
0 0.0 304114 0.99
1 1.0 3177 0.01
2 2.0 163 0.00
3 3.0 38 0.00
4 4.0 11 0.00
5 5.0 4 0.00
6 6.0 3 0.00
7 7.0 1 0.00

Previous Loan History and Default RiskΒΆ

The chart below shows the default rate based on whether applicant has previous applied for loans with Home Cred: No Previous App. - no previous applications for client found (i.e. new clients) All Accepted - all previous applications were accepted < 25% Rejected - less than 1/4 applications were rejected > 25% Rejected - more than 1/4 applications were rejected

Interestingly we can see that while applicants whose previous loans were rejected are significantly more likely to default when finally given a loan previous clients with no failed applications have a higher default risk than new clients.

This likely limits the usefulness of the previous_application table because only a small proportion of clients have previously rejected applications

<Figure size 1000x600 with 0 Axes>
No description has been provided for this image
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\4178975199.py:5: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['AnyPreviousRejections'] == 1][col],
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\4178975199.py:7: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['AnyPreviousRejections'] == 0][col], label=f'{col} - No Rejections', shade=True)
No description has been provided for this image
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\3528939979.py:10: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['AnyPreviousDefaults'] == 1][col],
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\3528939979.py:12: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['AnyPreviousDefaults'] == 0][col], label=f'{col} - No Rejections', shade=True)
<Figure size 1200x600 with 0 Axes>
No description has been provided for this image
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\3528939979.py:10: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['AnyPreviousDefaults'] == 1][col],
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\3528939979.py:12: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['AnyPreviousDefaults'] == 0][col], label=f'{col} - No Rejections', shade=True)
No description has been provided for this image
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\3528939979.py:10: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['AnyPreviousDefaults'] == 1][col],
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\3528939979.py:12: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(data=features_matrix[features_matrix['AnyPreviousDefaults'] == 0][col], label=f'{col} - No Rejections', shade=True)
No description has been provided for this image

We can clearly see that clients who had run into payment issues with their past loans tend to have a signficantly lower credit ExtSource3 however there is almost no difference with other scores. This incidates that the data from Home Credit itself is only included in the third rating (which might explain its higher explantatory power in our Logistic model)

AmtIncomeTotal
AmtCredit
AmtAnnuity
AmtGoodsPrice
AmtReqCreditBureauHour
AmtReqCreditBureauDay
AmtReqCreditBureauWeek
AmtReqCreditBureauMon
AmtReqCreditBureauQrt
AmtReqCreditBureauYear
MaxbureauamtAnnuity
MaxbureauamtCreditMaxOverdue
MaxbureauamtCreditSum
MaxbureauamtCreditSumDebt
MaxbureauamtCreditSumLimit
MaxbureauamtCreditSumOverdue
MeanbureauamtAnnuity
MeanbureauamtCreditMaxOverdue
MeanbureauamtCreditSum
MeanbureauamtCreditSumDebt
MeanbureauamtCreditSumLimit
MeanbureauamtCreditSumOverdue
MinbureauamtAnnuity
MinbureauamtCreditMaxOverdue
MinbureauamtCreditSum
MinbureauamtCreditSumDebt
MinbureauamtCreditSumLimit
MinbureauamtCreditSumOverdue
PrevAmtApplicationMean
PrevAmtApplicationSum
PrevAmtCreditMean
PrevAmtCreditSum
PrevAmtDownPaymentSum

Loan PurposesΒΆ

V:\projects\ppuodz-ML.4.1\shared\graph.py:1529: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  
No description has been provided for this image
C:\Users\Paulius\AppData\Local\Temp\ipykernel_29624\4000396085.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  prop_df = features_matrix.groupby('NameContractType')['TARGET'].value_counts(normalize=True).unstack().fillna(0)
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

EDA SummaryΒΆ

The EDA was performed in paralel with performing feature enginerring (aggregation of non-main tables) and building an initial LGBM model (using all features), to minimize unnecessary complexity only features which have some importance { > X } are included.